Skip to content

Conversation

@navidcy
Copy link
Member

@navidcy navidcy commented Oct 9, 2025

No description provided.

@navidcy navidcy added the testing 🧪 Tests get priority in case of emergency evacuation label Oct 9, 2025
@simone-silvestri
Copy link
Collaborator

I am very interested in this. Let's hope it works and we can move on from julia 1.10

@simone-silvestri
Copy link
Collaborator

I am disabling the reactant tests for the moment to check if the rest works.

@simone-silvestri
Copy link
Collaborator

If docs still break on the internal_tide.jl example NaNing, I would try with a GridFittedBottom instead of a PartialCellBottom, which is not even tested, and, if tests pass, merge this PR and move forward trying to assess what the problems with partial cells are.

@simone-silvestri
Copy link
Collaborator

Seems that we are hitting the same NaN issue on the internal tide example

@navidcy
Copy link
Member Author

navidcy commented Oct 9, 2025

Seems that we are hitting the same NaN issue on the internal tide example

the ghosts of the past still haunt us....

@simone-silvestri
Copy link
Collaborator

Apparently also GridFittedBottom fails...

@simone-silvestri
Copy link
Collaborator

If I run the example locally, it works. Why would it error on CI? Do we have a way to reproduce this error locally?

@ali-ramadhan
Copy link
Member

If I run the example locally, it works. Why would it error on CI? Do we have a way to reproduce this error locally?

One thing to try might be to run the example locally and on CI using the exact same Manifest.toml if possible. We can commit a Manifest.toml to this branch for debugging. I can't think of which dependency would lead to such a big difference but it's one thing we can control for.

@navidcy
Copy link
Member Author

navidcy commented Oct 16, 2025

From the Julia v1.11 chat I recall that the error was showing up only for unix, not for mac?

@giordano

This comment was marked as resolved.

@giordano

This comment was marked as resolved.

@giordano

This comment was marked as resolved.

@giordano
Copy link
Collaborator

giordano commented Oct 27, 2025

The plot thickens: it works correctly in Julia v1.12 on Ampere eMAG (aarch64) with AlmaLinux 8.10 as operating system, which rules out an operating system difference. aarch64 is also the architecture on macOS, so I'm starting to suspect there's an architecture dependence. Can someone point me to the operation performed on the u field? Is there any chance there you're running any BLAS operation? Edit: using in x86-64 Julia v1.12 the libopenblas shipped in Julia v1.10 doesn't fix the issue, so that'd rule out BLAS version differences. Edit 2: also works on Intel(R) Xeon(R) CPU X5670 with CentOS 8 Stream, I'm getting more and more confused, sigh. Pointers to the relevant code would still be very welcome.

@glwagner
Copy link
Member

The plot thickens: it works correctly in Julia v1.12 on Ampere eMAG (aarch64) with AlmaLinux 8.10 as operating system, which rules out an operating system difference. aarch64 is also the architecture on macOS, so I'm starting to suspect there's an architecture dependence. Can someone point me to the operation performed on the u field? Is there any chance there you're running any BLAS operation? Edit: using in x86-64 Julia v1.12 the libopenblas shipped in Julia v1.10 doesn't fix the issue, so that'd rule out BLAS version differences. Edit 2: also works on Intel(R) Xeon(R) CPU X5670 with CentOS 8 Stream, I'm getting more and more confused, sigh. Pointers to the relevant code would still be very welcome.

Nice work so far though!!

The entire time-step is a complex chain of operations. I do think it is a good start to save down all fields every time-step. We may find that differences arise in one field versus another. Note that the NaNChecker checks u only as a proxy for the entire state. Here the prognostic state should be model.velocities (u, v, w) and model.tracers.b.

@glwagner
Copy link
Member

glwagner commented Oct 27, 2025

To save every iteration chnage this line

schedule = TimeInterval(save_fields_interval),

to schedule = IterationInterval(1). Also I think we should add v (I see u and w but not v there).

The difference should arise in the very first time-step? We could compare those. It seems annoying laborious to do this across architectures, but maybe @giordano you have good ideas how to do this efficiently

@simone-silvestri
Copy link
Collaborator

Shall we just move the tests to julia 1.11 then until Reactant and Enzyme are ready?

@giordano
Copy link
Collaborator

giordano commented Nov 4, 2025

How easy would it be to run Reactant/Enzyme tests only with 1.11 in Buildkite? It shouldn't be hard in github actions, but I'm less familiar with buildkite.

@simone-silvestri
Copy link
Collaborator

simone-silvestri commented Nov 4, 2025

I can try giving it a go. We can actually add another pipeline, and probably need a new queue. @ali-ramadhan would you know how to set this up on nautilus if I separate out the tests in a new pipeline?

@simone-silvestri
Copy link
Collaborator

I think I have managed to separate them out.

@simone-silvestri

This comment was marked as resolved.

@giordano

This comment was marked as resolved.

Co-authored-by: Mosè Giordano <[email protected]>
@simone-silvestri

This comment was marked as resolved.

@giordano

This comment was marked as resolved.

@simone-silvestri

This comment was marked as resolved.

giordano

This comment was marked as resolved.

@giordano

This comment was marked as resolved.

@giordano giordano force-pushed the ncc/julia-v1.12 branch 5 times, most recently from 67515b7 to e81db39 Compare November 5, 2025 00:07
@giordano
Copy link
Collaborator

giordano commented Nov 5, 2025

Alright, I believe I found the right incantation in 94bc181.

The summary is

  • the step 🏕️ initialize general environment was instantiating the first environment with Julia v1.12, creating the Manifest.toml
  • the step 🏕️ initialize enzyme environment would try to use the Manifest.toml generated by the previous step (the two steps happen one after the other, in the same machine & directory) with Julia v1.11, failing.

My solution was to

  • rename the manifest generated by the first step to the version-specific name after instantiation
  • also rename the Enzyme manifest to the version-specific name after instantiation. This is strictly unnecessary if the previous step is already disambiguating the manifests, but having version-specific names everywhere is a more robust approach anyway.

@navidcy
Copy link
Member Author

navidcy commented Nov 5, 2025

The docs built and the internal tide example looks great!
https://clima.github.io/OceananigansDocumentation/previews/PR4836/literated/internal_tide/

There seem to be some errors in the GPU abstract operations CI:

https://buildkite.com/clima/oceananigans/builds/26558#019a515f-fc97-4a52-96e9-a8a5894dffdd/7-53

If that smooths out, I'll work on bringing the doctests back in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing 🧪 Tests get priority in case of emergency evacuation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants